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Learning to classify with possible sensor failures 
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Abstract —In this paper, we propose a general framework 
to learn a robust large-margin binary classifier when corrupt 
measurements, called anomalies, caused by sensor failure might 
be present in the training set. The goal is to minimize the gener¬ 
alization error of the classifier on non-corrupted measurements 
while controlling the false alarm rate associated with anomalous 
samples. By incorporating a non-parametric regularizer based 
on an empirical entropy estimator, we propose a Geometric- 
Entropy-Minimization regularized Maximum Entropy Discrim¬ 
ination (GEM-MED) method to learn to classify and detect 
anomalies in a joint manner. We demonstrate using simulated 
data and a real multimodal data set. Our GEM-MED method can 
yield improved performance over previous robust classification 
methods in terms of both classification accuracy and anomaly 
detection rate. 

Index Terms —sensor failure, robust large-margin training, 
anomaly detection, maximum entropy discrimination. 


If not robustified to anomalous measurements, classification 
algorithms may suffer from severe degradation of perfor¬ 
mance. Therefore, when anomalous samples are likely, it 
is crucial to incorporate outlier detection into the classifier 
design. This paper provides a new robust approach to design 
outlier resistant large margin classifiers. 



I. Introduction 

Large margin classifiers, such as the support vector machine 
(SVM) |Tj and the maximum entropy discrimination (MED) 
classifier [2], have enjoyed great popularity in the signal 
processing and machine learning communities due to their 
broad applicability, robust performance, and the availability 
of fast software implementations. When the training data is 
representative of the test data, the performance of MED/SVM 
has theoretical guarantees that have been validated in practice 
0 0 0 Moreover, since the decision boundary of the 
MED/SVM is solely defined by a few support vectors, the 
algorithm can tolerate random feature distortions and pertur¬ 
bations. 

However, in many real applications, anomalous measure¬ 
ments are inherent to the data set due to strong environmen¬ 
tal noise or possible sensor failures. Such anomalies arise 
in industrial process monitoring, video surveillance, tactical 
multi-modal sensing, and, more generally, any application that 
involves unattended sensors in difficult environments (Fig. [I]). 
Anomalous measurements are understood to be observations 
that have been corrupted, incorrectly measured, mis-recorded, 
drawn from different environments than those intended, or 
occurring too rarely to be useful in training a classifier 0. 
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Fig. 1. Due to corruption in the training data the training and testing sample 
distributions are different from each other, which introduces errors into the 
decision boundary. 


A. Problem setting and our contributions 

We divide the class of supervised training methods into four 
categories, according to how anomalies enter into different 
learning stages. 


TABLE I 

Categories for supervised training algorithms via different 

ASSUMPTION OF ANOMALIES 



Training set (uncorrupted) 

Training set (corrupted) 

Test set 
(uncorrupted) 

classical learning 
algorithms (e.g. SVM (6j, 
MED |2j) 

Robust classification & 
training (e.g. ROD j7], 

this paper) 

Test set 
(corrupted) 

anomaly detection (e.g. 

MV (8), and GEM 0, 

_ in _ 

Domain adaptation & 
transfer learning (e.g. 

_ GD> _ 


As shown in Table |T| a majority of learning algorithms 
assume that the training and test samples follow the same 
nominal distribution and neither are corrupted by anomalies. 
Under this assumption, an empirical error minimization algo¬ 
rithm can achieve consistent performance on the test set. In the 
case that anomalies exist only in the test data, one can apply 
anomaly detection algorithms, e.g. 0-G9. (D to separate 
the anomalous samples from nominal ones. Under additional 
assumptions on the nominal set, these algorithms can effec¬ 
tively identify an anomalous sample under given false alarm 
rate and miss rate. Furthermore, in the case that both training 
and test set are corrupted, possibly with different corruption 
rate, domain adaptation or transfer learning methods may be 
applied UQ © 

This paper falls into the category of robust classification 
& training in which possibly anomalous samples occur in 
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the training set while the test set remain uncorrupted. Such 
a problem is relevant, for example, when high quality clean 
training data is too expensive or too difficult to obtain. The 
case where both the training and test data might be corrupted 
will not be treated in this paper. Our goal is to train a classifier 
that minimizes the generalization error with respect to the 
nominal test distribution, even though the training set may 
be corrupted. 

The theory of robust classification has been thoroughly 
investigated g, jig -jT8). Tractable robust classifiers 
that identify and remove outliers, called the Ramp-Loss based 
learning methods, have been studied in 0 , (T9), (20). 
Among these methods, Xu et al. 0 proposed the Robust- 
Outlier-Detection (ROD) method as an outlier detection and 
removal algorithm using the soft margin framework. Training 
the ROD algorithm involves solving an optimization problem, 
for which dual solution is obtained via semi-definite program¬ 
ming (SDP). Like all the Ramp-Loss based learning models, 
this optimization is non-convex requiring random restarts to 
ensure a globally optimal solution (5), [18]. In this paper, in 
contrast to the models above, a convex framework for robust 
classification is proposed and a tractable algorithm is presented 
that finds the unique optimal solution of a penalized entropy- 
based objective function. 

Our proposed algorithm is motivated by the basic princi¬ 
ple underlying the so-called minimal volume (MV) /minimal 
entropy (ME) set anomaly detection method 0. ©-[©• 
Such methods are expressly designed to detect anomalies 
in order to attain the lowest possible false alarm and miss 
probabilities. In machine learning, nonparametric algorithms 
are often preferred since they make fewer assumptions on 
the underling distribution. Among these methods, we focus 
on the Geometric Entropy Minimization (GEM) algorithm 
0’ p| . This algorithm estimates the ME set based on the 
k-nearest neighbor graph (k-NNG), which is shown to be 
the Uniformly Most Powerful Test at given level when the 
anomalies are drawn from an unknown mixture of known 
nominal density and uniform anomalous density j9j. A key 
contribution of this paper is the incorporation of the non¬ 
parametric GEM anomaly detection into a binary classifier 
under a non-parametric corrupt-data model. 

The proposed framework, called the Geometric-Entropy- 
Minimization regularized by Maximum Entropy Discrimination 
(GEM-MED), follows a Bayesian perspective. It is an exten¬ 
sion of the well-established Maximum Entropy Discrimination 
(MED) approach proposed by Jaakkola et al. [2|. MED per¬ 
forms Bayesian large margin classification via the maximum 
entropy principle and it subsumes SVM as a special case. The 
MED model can also solve the parametric anomaly detection 
p) problem and has been extended to multitask classification 
|2l| . However, the application of MED to robust classification 
has remained open. In this paper, the proposed GEM-MED 
model, a fusion of GEM and MED, fills this gap by explicitly 
incorporating the anomaly detection false-alarm constraint 
and the mis-classification rate constraint into a maximum 
entropy learning framework. As a Bayesian approach, GEM¬ 


MED requires no tuning parameter as compared to other 
anomaly-resistant classification approaches, such as ROD 0. 
We demonstrate the superior performance of the GEM-MED 
anomaly-resistant classification approach over other robust 
learning methods on simulated data and on a real data set 
combining sensor failure. The real data set contains human- 
alone and human-leading-animal footsteps, collected in the 
field by an acoustic sensor array p2)-p4[. 

B. Organization of the paper 

What follows is a brief outline of the paper. In Section II, we 
review MED as a general framework to perform classification 
and other inference tasks. The proposed combined GEM¬ 
MED approach is presented in Section III. A variational 
implementation of GEM-MED is introduced in Section IV. 
Experimental results based on synthetic data and real data 
are presented in Section V. Our conclusions are discussed in 
Section [Vll 

II. From MED to GEM-MED: A general routine 

Denote the training data set as V t := {( y n , x n )} nG T> where 
each sample-pair (y n ^x n ) G ^ x T = V are independent. 
Denote the feature set X d VP and the label set as y. For 
simplicity, let y = { — 1,1}. The test data set is denoted as 
V s := {x m } mG s. We assume that {(y n , x n )} neT are i.i.d. 
realizations of random variable (Y,X) with distribution Vt, 
conditional probability density p(X\Y = y,Q) and prior 
p(Y = y),y G y, where O is the set of unknown model 
parameters. We denote by p(Y = y,X\ O) = p(X\Y = 

2 /, O )p(Y = y) the parameterized joint distribution of (Y,X). 
The distribution of test data, denoted as V s , is defined sim¬ 
ilarly. Vnom denotes the nominal distribution. Finally, we 
define the probability simplex over the space y x X. 

A. MED for classification and parametric anomaly detection 

The Maximum entropy discrimination (MED) approach to 
learning a classifier was proposed by Jaakkola et al (2). The 
MED approach is a Bayesian maximum entropy learning 
framework that can either perform conventional classification, 
when Vt = V s = P nom , or anomaly detection, when Vt 
and Vt = Vnom • In particular, assume that all parameters 
in 0 are random with prior distribution po(@)« Then MED 
is formulated as finding the posterior distribution q(Q) that 
minimizes the relative entropy 

K1 (?(©) || Po(©)) := I log Q(dS) (1) 

subject to a set of P constraints on the risk or loss: 

J £i ip, iVn, X n ); ©) q(dO) < 0, Vn e T, 1 < i < P. (2) 

The constraint functions {Ci}f =1 can correspond to losses 
associated with different type of errors, e.g. misdetection, false 
alarm or misclassification. For example, the classification task 
defines a parametric discriminant function Tc • A^ x x x V -A 
M + as 

fcip, (y n ,x n );Q) :=\ogp(Y = y n \x n ;0)/p(Y ^ y n |*„;©). 
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In the case of the SVM classification, the loss function is 
defined as 

fc-'i ~ £c (jP-i (Urn &n)’i 0) - = [Cn C {jPi (Urn ©)] • 

(3) 

Other definitions of discriminant functions are also possible 

© 

An example of an anomaly detection test function Ci = 
jCd : Ax x X —y M, is 

C D (p,x n -,Q) := ~[\ogp(x n ;Q) -/3], (4) 

where p(x n ;Q) is the marginal likelihood p(x n ; (-)) = 

J2y n eyP( x = x n\ Y = Vn,Q)p{Y = y n )- The constraint 
function Q has the interpretation as local entropy of X in the 
neighborhood of X — x n . Minimization of the average con¬ 
straint function yields the minimal entropy anomaly detector 
0 p| . The solution to the minimization ^ yields a poste¬ 
rior distribution p(Y = y\x n , 0) where 0 := 0 U {£ n } U {/ 3 }. 
This lead to a discrimination rule 

y* = argmin y j- J log p(y,x m ; 0)g(d0)|, x m G D s . 

(5) 

when applied to the test data V s . 

The decision region {x G X\Y = y} of MED can have 
various interpretations depending on the form of the constraint 
function 0 and 0. For the anomaly detection constraint 
0, it is easily seen that the decision region is a /3-level-set 
region for the marginal p(x; 0), denoted as ^ p := {x n G | 
\ogp(x n ; 0) > /3}. Here ^ p is the rejection region associated 
with the test: declare x m G V s as anomalous whenever 
x m 0 ^/ 3 ; and declare it as nominal if x m G ^p. With 
a properly-constructed decision region, the MED model, as 
a projection of prior distribution p o (0) into this region, can 
provide performance guarantees in terms of the error rate or 
the false alarm rate and can result in improved accuracy ED 

ED 

Similar to the SVM, the MED model readily handles non- 
parametric classifiers. For example, the discriminant function 
Tc (p, (y, x), 0) can take the form y[Q(x)] where 0 = / 
is a random function, and f : X y can be specified by 
a Gaussian process with Gaussian covariance kernel 
More specifically, / G H, where H is a Reproducing Ker¬ 
nel Hilbert Space (RKHS) associated with kernel function 
K : X x X See ED for more detailed discussion. 

MED utilizes a weighted ensemble strategy that can im¬ 
prove the classifier stability [2]. However, like SVM, MED is 
sensitive to anomalies in the training set. 

B. Robustified MED when there is an anomaly detection 
oracle 

Assume an oracle exists that identifies anomalies in the 
training set. Using this oracle, partition the training set as V t = 

jynom^jyanm^ whefe ( x ^y n ) „ Vnom if (x n ,y n ) G 

and (x n ,y n ) / Vnom, if (x n ,y n ) G V? nm . Given the oracle, 
one can achieve robust classification simply by constructing a 


classifier and an anomaly detector simultaneously on 
This results in a naive implementation of robustified MED as 

min KIL (q(Q) || p o (0)) (6) 

</(©)€ a 5 

s.t. J C G (p, (y n ,x n ); 0) q(dQ) < 0, (x n ,y n ) G Pf om , 

J C D (p,* n ;0) q(dG) < 0, (x n ,y n ) G 

where 0 = ©U{/3}U{^„} raeT , the large-margin error function 
Cc is defined in 0 and the test function Cd is defined in ([4]). 
The prior is defined as p 0 (Q) = Po{Q)po{(3) Yl neT Po(^n)- 

Of course, the oracle partition V t = V^ orn U V^ nrn is not 
available a priori. The parametric estimator 4/^ of p can be 
introduced in place of in ([6]). However, the estimator 

is difficult to implement and can be severely biased if 
there is model mismatch. Below, we propose an alternative 
nonparametric estimate of the decision region 4 /p that learns 
the oracle partition. 

III. The GEM-MED: model formulation 

A. Anomaly detection using minimal-entropy set 

As an alternative to a parametric estimator of the level- 
set T ' / 3 := {x m G X\\ogp(x m -Q) > /?}, we pro- 

pose to use a non-parametric estimator based on the 
minimal-entropy (ME) set Cli-p. The ME set Di _p := 
arg minx{H(A)\ f A p(x)dx > /3} is referred as the minimal- 
entropy-set of false alarm level 1 — /?, where H (A) = 
— f A logp(x) p(x)dx is the Shannon entropy of the density 
p(x) over the region A. This minimal-entropy-set is equivalent 
to the epigraph-set {A : f A p(x)dx > /?} as illustrated in Fig. 

m 



Fig. 2. The comparison of level-set (left) and the epigraph-set (right) w.r.t. 
two continuous density function p(x). The minimum-entropy-set is computed 
based on the epigraph-set. 

Given Qi-p, the ME anomaly test is as follows: a sample 
x n is declared anomalous if x n 0 £l\-p\ and it is declared 
nominal, when x n G £l\-p. It is established in Q that when 
p(pc) is a known density, this test is a Uniformly Most Powerful 
Test (UMPT) at level (3 of the hypothesis H 0 : x ~ p(x) vs. 
Hi : x ~ p(x) + eU(x) 9 where U(x) is the uniform density 
and e G [0,1] is an unknown mixture coefficient. 
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In@, pol l, the Geometric Entropy Minimization (GEM) is 
proposed and the ME set ft 1-/3 is estimated by Tli-p using 
a K-point minimal spanning tree @ or a bipartite K-point 
k-nearest neighbor graph (BP-kNNG) pQ| . 


B. The GEM based on BP-kNNG and relaxation 

The implementation of GEM is accomplished by applying 
the BP-kNNG i fTO) . Specifically, define the training samples 
from class c as D t c := {x n \y n = c} where c G {±1}. Define a 
random binary partition V '£ = V t ,c U D^ ,c , where and 
p)M,t are ^^ 01 ^ As illustrated in Fig. [5] the BP-kNNG is a 
graph connecting these two parts, in which one part is used 
to find the local entropy h n := h(x n ), while the other part is 
used to compute the average entropy within a neighborhood. 
For x n G V^ c , the local entropy h n is estimated by h n = 
d log (d k (x n ,T>™ 


,c )) - log (jGG)’ where d k {x n ,V^' c ) is 

the sum of k-nearest neighbor (kNN) distance from the target 
sample x n to its M c reference samples in d is the 

intrinsic dimension of 
ball in R d [26 ]. 


and Cd is the volume of the unit 


In |T0l , the BP-kNN based algorithm was implemented to 
estimate the ME set Eti-p of coverage probability 1 — /?. 
This was accomplished by solving the following discrete 
optimization problem: 


A*earg min L(A C ,V™’ C ), 

A c CT>?' c 

where L(A c ,T>f i,c ) := E d k {x n ,V^' c ), 


X n G-A c 


and where A c is a set of distinct K = \T\ (1 — /3) points 
in V^ c . It is shown in [10] that A* = fli-p is an asymp¬ 
totically consistent estimator of the ME set. Equivalently, let 
77 n G {0,1} be the indicator function of the event x n G A c and 
define d n := dkixmVf 1 ’ 0 ). Then the algorithm in [10] finds 
the optimal binary variables ^rj n G {0,l}|x n G j ,n = 
1,..., TV, that minimize 


E drAi 

x„eT>?’ c 


subject to 


E T}n > K. 


(7) 


To adapt the BP-kNN implementation of GEM to our frame¬ 
work, the binary weights ij n G {0,1} are relaxed to continuous 
weights in the unit interval [0,1] for all n G T . After 
relaxation, the constraint in ([7]) becomes J2 n Pn/\T\ > ft, 
where /3 = K/ |Tj = (1 — /3) > 0 is set so that the optimal 
solution {r] n \x n G A*} is feasible and the all-zero solution 
is infeasible.With the set of weights { rj n } ne T * ^e GEM 
problem in 0 can be transformed into a set of nonparametric 
constraints that fit the framework 0. This is discussed below. 


C. The GEM-MED as non-parametric robustified MED 

Now we can implement the framework in 0- Denote 

© := © U |/3 j U {Cn} nG T © {Vn} ne T © {7*} ze {±i}, where 

{£n} ne T 

weights in Sec. 
defined later. 


/ van liiiui^iiiuin uiv xi 

|/^| © {^n} n eT © {hn} 
are param eters as defined in ([6j, {hn} ne T are 
and /3, {7^}^e{±i} are var iables to be 


III-B 



Fig. 3. Figure (a) illustrates ellipsoidal minimum entropy (ME) sets for 
two dimensional Gaussian features in the training set for class 1 (orange 
region) and class 2 (green region). These ME sets have coverage probabilities 
1 — (3 under each class distribution and correspond to the regions of maximal 
concentration of the densities. The blue disks and blue squares inside these 
regions correspond to the nominal training samples under class 1 and class 
2, respectively. An outlier (in red triangle) falls outside of both of these 
regions. Figure (b) illustrates the bipartite 2-NN graph approach to identify the 
anomalous point, where the yellow disks and squares are reference samples 
in each class that are randomly selected from the training set. Note that the 
average 2-NN distance for anomalies should be significantly larger than that 
for the nominal samples. 


According to the objective function in 0, we specify the 
test function Cd as 

£d( 0, y; Z, d) := C D ({q n } , { 7 *} , y; z, d) 

*=* ( E 1 A dndn/ Tl - lz ) , z e {±1} , 


where 7 * > 0,z G {±1} is the threshold associated with d n on 
V t n {x n \y n = zj. Compared with 0, if 7 z = L*+ e, where 
L* is the optimal value in ([7]) and e > 0 is small enough, then 
for {rj n } neT satisfying C D < 0, the region {x n : r\ n > §} is 
concentrated on Qi-p D {x n \y n = zj , z e {±1}. 

As discussed in |III-B[ the constraint in 0 becomes the 
inequality constraint J2 n \y n =z 'in/ \T\ > /?. 

Assuming that © is random with unknown distribution 
#(©), the above expected constraints becomes 



,y;z,d)q(dQ ) <0,2£ {±1} , 


Vn/ \T\ 


q(dO) 


> f3,z e {±1}. 


( 8 ) 

(9) 


The constraint 0 is referred as the entropy constraint and 
constraint ([9]) is the epigraph constraint. As discussed above, 
the region {x n \p n > 7} f or q(Q) satisfying 0 and 0 is 
concentrated on Qi-p D {x n \y n = zj in each class 2 G {±1} 
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on average. With Cd , the test constraint 

J C D (p, x n ; 0) q(dQ) < 0, (x n , y n ) G V n t ° m 

in 0 is replaced by ([ 8 ]) and ([9]). 

For the classification part in ([ 6 ]), given rj n associated with 
each sample, the error constraints 


J C c (p, (: y n , *«); 0) <?(d0) < 0, (* n , y n ) e T>? om , 

in <01 is replaced by reweighted error constraints 

J [r)n£c (p, ( y n , *„); ©)] q(dQ) <0, n G T, 

with jCc defined as in 0 . Note that these constraints are 
applied to the entire training set. Summarizing, we have the 
following: 

Definition The Geometric-Entropy-Minimization Maximum- 
Entropy-Discrimination (GEM-MED) method solves 


min KIL (<?(0) || p 0 (©)) 

<?(©)£Aq 


( 10 ) 


s.t. J [rinCc (p, (y n ,Xn); ©)] q(dQ) <0, n G T, 
JC D (Q, y; z, d)q(dQ) < 0 , z e {± 1 } , 

/ 


E wm 


n:y n =z 


q(dS) > ft, z G {±1} 


where 0 , Cc and Cd are defined as before. 


IV. Implementation 

A. Projected stochastic gradient descent algorithm 

Note that © is a convex optimization w.r.t. the un¬ 
known distribution q(O). Therefore, it can be solved using 
the Karush-Kuhn-Tucker (KKT) conditions, which will result 
in a unique solution. We make the following simplifying 
assumptions under which our a computational algorithm is 
derived to solve © 

1) Assume that a kernelized SVM is used for the classifier 
discriminant Tc function. Following (2Q, (27), we 
assume that the decision function / follows a Gaussian 
random process on X, i.e., a positive definite covariance 
kernel K{x^Xj) is defined for all Xi,Xj G X and 
all finite dimensional distributions, i.e., distributions 
of samples ( f(xi))i e T ? follow the multivariate normal 
distribution 


( f( x i))i£T ~ AT (0 ,K), ( 11 ) 

where K = [K(xi,Xj)]ij e T is a specified covariance 
matrix. For example, K(x^Xj) := exp(— 7 ||x^ — Xj HI) 
for Gaussian RBF kernel covariance function. 

2) Assume a separable prior, as commonly used in 
Bayesian inference (2), (27) , (281 

Po( 0 ) = Po( 0 ) II Po(£n) II Po(r)n) f[ Po{lz)- 

nET n£T z£{±l} 

( 12 ) 


3) Assume that the hyperparameters {£ n } are exponential 
random variables and the indicator variables {r] n } are 
independent Bernoulli random variables, 

Po(£n) OC exp(—C ? (l - £„)), G (“OO, 1], 71 G T\ 
Po(jln) = Ber{p v ) 

with P v = 7 —-7—7-77 

1 + exp(-(a^ - T] n )) 

:= cr(a, - r? n ), e {0,1} , n G T; 

Po(7z) = ^,(7z); «G{±1}, (13) 

where (a r/ , c$) are parameters and 7 Z is the upper bound 
estimate for minimal-entropy in each class z = ± 1 given 
by GEM algorithm. a(x) = 1/(1 + exp(— x)) is the 
sigmoid function. 


Now by solving the primal version of optimization problem 
( flOl ), we have 

Result 4.1: The GEM-MED problem in (\0\ is convex 
with respect to the unknown distribution q(Q) and the unique 
optimal solution is a generalized Gibbs distribution with the 
density: 

q(d 0) = -rPo(rf0) exp (-£(©; A, y, k)) , (14) 

Z (A, [A , K) 

where 

^(©; A, k) := £( 0, /?, {£ n } , {r] n } , {7*} 5 A, «) 

^ ^ An7n^C,©,£ n ^ ^ b'z&DiZ 

n£T z£{± 1} 

- E E wth- E 

££{± 1 } n:y n =z z£{± 1 } 

with 0 = 0 U |/3| U {£n} ner U {t?„} ner U { 7 + 1 , 7 -tland 
where the dual variables A = {A n , n G T}, /a = (p z , z G ±1) 
and k = (ftz,z G ±1) are all nonnegative. Z(\,/a,k,) is 
partition function , which is given as 

Z(A,/x,«) = J exp (-£(©; A, p, k,)) p 0 (dQ). (15) 

The factor Cc,@,£ n •= £c(*;0,£n) is defined as in (3), 
: = is defined as in 0 . See the Appendix 

Sec. El for a detailed derivation. 


Moreover, we specify the error function as 


(Pi (Z/n? x n)'i 0? £n)— 2/n/(*^n)j (16) 


where Q := f : X ^ y is sl decision function associated with 


a nonparametric classifier as defined in Sec II-A 


Since the optimization problem is convex, we can equiva¬ 
lently solve a dual version of the optimization problem ( fTO) . 
In fact, we have the following result: 


Result 4.2: Assume that ©, ©, © hold, the dual 
optimization problem is given as 

max — log Z(A, /x, k) (17) 

A,/j,,k>0 

= - log / exp (—£?(©; A, y, k)) p 0 (dO) 
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Algorithm 1 The (kernel) GEM-MED algorithm 

Input: The training set V t C X x {±1} and the test set V s . The projection gradient step parameter 
< p./tp.T > 0. Prior distribution and assumptions given as (11)-(13). The kernel function K : X x X —► R 
is specified. 

1: Initialize: Set /x n = 0, = 0. Ao is set by applying conventional MED on T) 


2: for t = 1,..., T or until converge do 

3: Compute the gradient of log-partition function w.r.t \ L , p t and respectively, i.e. 

r9-ic>s^(A / , ; /x /? #c,) an( j ,/* p i6f) accorc p n g i\ w formula (21)-(23) where the expectation is ap¬ 

proximated via Gibbs sampling described as above. 


1: Update A„_. // 2 and k z via projected gradient descent, i.e. 


\ _ J\ 01ogZ((#i tl At,Kt))\ 

A n ,(t+|) — PrOJ^ A; 0 <A<Ci) | A n ,« — p | 


n E T. 


Pz,(t+ 1 ) 


K 3T.(t + l) 


(m*.* 


Proj {M:(1 > 0) - V 

Proj { 


. d\o%Z{y t , A(,k ( ) 


dn 

J{«:«>0} p«.* &T / 


Z E {-1;+1}, 

where proj| J .. 0<T< ^{u?} = min (max(.r. 0). C) defines the projection of x on the feasible set {z : 0 < 

z<cy 

5: end for 

Output: Assign label for test sample x m € T> s as 

V = \ ^ ^ ‘^nA n 't/n-^(*m? *n) / » 

l^r J 

where r? n = E [i] n \f] at the final iteration of step 4. 


Fig. 4. The proposed GEM-MED algorithm based on the projected stochastic gradient descent |29j. The gradient with respect to dual variables (20)-(23) 
can be approximated via Gibbs sampling as discussed in Sec. |iv] The constraints on the dual variable X n E [0, C\] are imposed by a clipping procedure 
projf^; o<' U ;<c}{' L(; } = min (max(m, 0), C) that is applied on each Gibbs move, similarly to the C-SVM algorithm |30j. The parameters (qp, cp, r) control 
the stepsize of the gradient descent algorithm. 


= (An + log (1 - A n /c)) - ^2 Vzlz + /? Kz 

neT ze{±i} ze{±i} 

- log J exp (l,Q(K © (yy T ), (A © rj ))^ 

x po(rj) exp ( T] T (—fj, © d + k © e)) drj (18) 

where (A, //, k) are nonnegative dual variables as defined in 
e is the all l’s vector, © is Hadamard product, © is the 
Kronecker product, respectively, and 

Q(K. x ) = x r Kx 

is the quadratic form associated with the kernel K. 

See Appendix Sec. [B] for derivations of this result. 

It is seen from Gil that the dual objective function is 
concave w.r.t. dual variables (A,//,*&). However, the integral 
in © is not closed form, so an explicit form as a quadratic 
optimization in SVM is not available. Nevertheless, the only 
coupling in ( p~8] ) comes from the joint distribution q(f,rj). 
In particular, under the prior assumption ©, ©, ©, the 
optimal solution ( fl4l > satisfies 

1) q(6) = q(f,rj) II„ q{in)q{l+i)q{l-i) is factorized. 

2) q(rj\f) = UneT^iVnlf), i-e. the {rj n ,n e T} are condi- 
tional independent given the decision boundary function 


/. Moreover, 

q{Vn\f) = Ber(q v ), (19) 

with q v = a (p n F n (f)) 

where p n := log > Fn(f) ■= 

An [y„/(x„) - 1] -Pyjin + Ky n / \T\ is the 

sigmoid function as ©. 

3) f\y ~ A1(/|/^,a(-). k )> where 

fvM') = ^2 X nVnynK{-,X n ) G U (20) 
neT 

See Appendix Sec. [C] for details. 

Given above results, we propose to use the projected 
stochastic gradient descent (PSGD, p9| ) algorithm to solve 
the dual optimization problem in ( [18] ). The gradient vectors of 
the dual objective function in ( fl~8] ) w.r.t. A, /z, k, respectively, 
are computed as 

-X- [- \ogZ(\,y, k)] 

= 1 2 - E g (/,7 7 ) [y r»2/n/(x„)] + —~~r j neT; (21) 

C — A n 

— [- log Z(X, y, k)] 
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q(f,i 7 ) ] V! Vndn / 7z> 2 £ {^1} > 

Lra:j/„=z J 


= E, 


— [- log Z(A, n , k) 


— ^ in^-) 


E ^ 


, z G {±1} . 


( 22 ) 


(23) 


Note that the expectation w.r.t. q(f,rj) are approximated by 
Gibbs sampling with each conditional distribution given by 
For a detailed implementation of the Gibbs sampler, 
see the Appendix Sec. [D] 


A complete description of algorithm is presented in Algo¬ 
rithm 1. It is remarked that in ( [19] ) the probability of {p n = 0} 
is proportional to the sum of margin of classification and 
negative local entropy value. The role of the dual variables 
(rj n , He) in ( P~9j ) and ( [ 20 ] ) is to balance the classification margin 
y /(•) and local entropy h in determining the anomalies. 


with mean m_i = 
mon covariance E = 


(3,3) 


20 

16 


and 

16 

20 


m + 1 = —ra_i and com- 
. The sample follows the 


log-linear model log p(y,x;Q) oc 1/2t/(w t x + 6) where 
O = (w, b). A Gaussian prior was used as po(Q) = 
•V(w; 0,<4I)JV(6;0 ,(t£). 


We followed the same models as in 0- In particular, the 
anomalies in the training set were drawn uniformly from a 
ring with an inner radius of R and outer radius R + 1, where 
R was assigned as one of the values [15,35,55,75]. Define 
R to be the noise level of the data set, since the larger R the 
higher the discrepancy between the nominal distribution and 
the anomalous distribution. The samples then were labeled as 
{0,1} with equal probability. The size of the training set was 
100 for each class, and the ratio of anomaly samples was r a . 
The test set contained 2000 uncorrupted samples from each 
class. See Fig. [5] (a) for a realization of the data set and the 
classifiers. 


B. Prediction and detection on test samples 


The GEM-MED classifier is similar to the standard MED 
classifier in < 0 : 


y* = argmax^ yf(x m )q(f\f),'D t )df') , 

sign \ ^ ^ f) n \ n y n K(x m: x n ^ ( ^ (24) 

ln£T J 


where fj is the conditional mean estimator of rj given by 
Algorithm 1. 


To find the anomalies given f), the rejection region 
{x n \fj n > := V t C\VL l _p is identified. Then, using this data 

to form a nominal set, for each test sample we compute the 
sum of all k-nearest neighbor distances := dk{x m ^V t D 
Hi_|) relative to V t D A sample x m is declared an 

anomaly if d m > and otherwise it is declared to be 
nominal. Here the threshold d is set using the Leave-One-Out 
resampling approach as described in 0 * 


V. Experiments 

We illustrate the performance of the proposed GEM-MED 
algorithm on simulated data as well as on a real data col¬ 
lected in a field experiment. We compare the proposed GEM¬ 
MED with the SVM implemented by LibSVM (30) and the 
Robust-Outlier-Detection algorithm implemented with code 
obtained from the authors of [7]. For the simulated data 
experiment, a linear kernel SVM is implemented, and for 
the real data, a Gaussian RBF kernel SVM with kernel 
K (x^, xj) = exp(— 7 ||x^ — Xj |||) is implemented and the 
kernel parameter 7 > 0 is tuned via 5-fold-cross validation. 


A. Simulated experiment 

For each class c G {±1}, we generate samples from the 
bivariate Gaussian distribution A/’(m + i, E) and E), 



(b) 

Fig. 5. (a) The classification decision boundary for SVM, ROD and GEM¬ 
MED on the simulated data set with two bivariate Gaussian distribution 
in the center and a set of anomalous samples 
for both classes distributed in a ring. Note that SVM is biased toward the 
anomalies (within outer ring support) and ROD and GEM-MED are insensitive 
to the anomalies, (b) The Illustration of anomaly score rj n for GEM-MED 
and ROD. The GEM-MED is more accurate than ROD in term of anomaly 
detection. 

We first compare the classification accuracy of SVM, 
Robust-Outlier-Detection (ROD) with outlier parameter p and 
GEM-MED, under noise level R and a range of corruption 
rates r a G {0.2, 0.3, 0.4, 0.5}. We used the BP-kNNG imple¬ 
mentation of GEM, where the k-nearest neighbor parameter 
k = 5. In the update of the GEM-MED dual variables 
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Fig. 6. (a) Miss-classification error (%) vs. noise level R for corruption rate r a = 0.2. (b) Miss-classification error (%) vs. corruption rate E [ 77 ] for 

ring-structureed anomaly distribution having ring R = 55. (c) Recall-precision curve for GEM-MED and RODs on simulated data for corruption rate = 0.2. 
(d) The AUC vs. corruption rate r a for GEM-MED and ROD with a range of outlier parameters p. From (a) and (b), GEM-MED outperforms both SVM/MED 
and ROD for various p in classification accuracy. From (c), under the same corruption rate, we see that GEM-MED outperforms ROD in terms of the precision- 
recall behavior. This due to the superiority of GEM constraints in enforcing anomaly penalties into the classifier. From (d), The GEM-MED outperforms 
RODs in terms of AUC for the range of investigated corruption rates. 


(A, /x, k), the learning rate (() is chosen based on a 
comparison of classification performance of the GEM-MED 
under a range of noise levels R and corruption rates r a , as 
shown in Fig.|7](a)-(c). Note that when ip E [1,4] x 10 -3 , ^ £ 
[1,4] x 10 -2 , r G [1, 5] x 10 -2 , the performance of the GEM¬ 
MED is stable in terms of the averaged missclassification 
error and the variance. We fix ( 99 , -0, r) in the stable range 
in the following experiments. For the ROD, we investigated a 
range of algorithm parameters, in particular outlier parameter 
p G {0.02,0.2,0.6} for comparison, and we observed that 
the value p = 0.02 gives the best classification performance 
regardless of the setting of R G {15,35,55,75} or r a G 
{0.2,0.3,0.4,0.5}. Recall that the ROD parameter p is a fixed 
threshold that determines the proportion of anomalies, i.e., 
the proportion of nonzero p n (7). Compared to the ROD, the 
GEM-MED as a Bayesian method requires no tuning parame¬ 
ter to control the proportion of anomalies. In the experiments 


below, we compare the ROD for a range of outlier parameters 
p with GEM-MED for a single choice of which were 

tuned via 5-fold-cross-validation of misclassification rate over 
50 trial runs. 

Fig. [ 6 ja) shows the miss-classification error (%) versus 
various noise level R (with r a = 0.2), and Fig.[ 6 jb) shows the 
miss-classification error under different corruption rate settings 
(with R = 55). In both experiments, GEM-MED outperforms 
ROD and SVM in terms of classification accuracy. Note that 
when the noise level or the corruption rate increases, the 
training data become less representative of the test data and the 
difference between their distributions increases. This causes 
a significant performance deterioration for the SVM/MED 
method, which is demonstrated in Fig. [ 6 ja ) and Fig. | 6 }b). 
Fig. [5] (b) also shows the bias of the SVM classifier towards 
the anomalies that lie in the ring. Comparing to GEM-MED 
and ROD in Fig [ 6 ja) and Fig. | 6 }b), the former method is 
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less sensitive to the anomalies. Moreover, since the GEM¬ 
MED model takes into account the marginal distribution for 
the training sample, it is more adaptive to anomalies in the 
training set, as compared to ROD, which does not use any 
prior knowledge about the nominal distribution but only relies 
on the predefined outlier parameter p to limit the training loss, 

In Fig. 0c) we compare the performance of GEM-MED and 
ROD in terms of precision vs recall for the same corruption 
rate as in Fig. 0a) and 0b). In ROD and GEM-MED, the 
estimated weights rj n G [0,1] for each sample can be used to 
infer the likelihood of anomalies. In particular, in GEM-MED 
the corresponding latent variable estimate fj n is obtained at the 
final iteration of the Gibbs sampling procedure, as described in 
Appendix Sec. [D] Following the anomaly ranking procedure 
in 0, these anomaly scores are placed in ascending order. 
We compute the precision and recall using this ordering by 
averaging over 50 runs. Precision and recall are measures that 
are commonly used in data mining (31): 

Precision - li n : Bn - Pc} H {n : (x n ,j/ n ) are anomalous}] 

\{n : rj n < p c }\ 

Recall _ \{n : r]n < Pc} H {n : (x n ,g/ n ) are anomalous}] 
\{n : (x n , 2 / n ) are anomalous}| 

where the threshold p c is a cut-off threshold that is swept 
over the interval [0,1] to trace out the precision-recall curves 
in Fig. 0c). It is evident from the figure that the proposed 
GEM-MED outlier resistant classifier has better precision- 
recall performance than ROD. Other corruption rates r a lead 
to similar results. In Fig.0d), we compare the performance of 
GEM-MED, RODs under different corruption rates in terms 
of the Area Under the Curve (AUC), a commonly used 
measure in data mining (3TJ. Similar to Fig 0c), the GEM¬ 
MED outperforms RODs in terms of AUC for the range of 
investigated corruption rates. 


B. Footstep classification experiment 

The proposed GEM-MED method was evaluated on experi¬ 
ments on a real data set collected by the U.S. Army Research 
Laboratory [23), (24), (32). This data set contains footstep 
signals recorded by a multisensor system, which includes four 
acoustic sensors and three seismic sensors. All the sensors are 
well-synchronized and operate in a natural environment, where 
the acoustic signal recordings are corrupted by environmental 
noise and intermittent sensor failures. The task is to dis¬ 
criminate between human-alone footsteps and human-leading- 
animal footsteps. We use the signals collected via four acoustic 
sensors (labeled sensor 1,2,3,4) to perform the classification. 
See Fig. 0 Note that the fourth acoustic sensor suffers from 
sensor failure, as evidenced by its very noisy signal record 
(bottom panel of Fig.0. The data set involves 84 human-alone 
subjects and 66 human-leading-animal subjects. Each subject 
contains 24 75%-overlapping sample segments to capture 
temporal localized signal information. We randomly selected 
25 subjects with 600 segments from each class as the training 
set. The test set contains the rest of the subjects. In particular, 


it contains 1416 segments from human-alone subjects and 984 
segments from human-leading-animal subjects. 


First accoustic sensor 



Third accoustic sensor 




i 1.5 ; 

second 

Forth accoustic sensor 



1.5 2 

second 


Fig. 8. A snapshot of human-alone footstep collected by four acoustic 
sensors. 


In a preprocessing step, for each segment, the time interval 
with strongest signal response is identified and signals within a 
fixed size of window (1.5 second) are extracted from the back¬ 
ground. Fig. 0 shows the spectrogram (dB) of human-alone 
footsteps and human-leading-animal footsteps using the short- 
time Fourier transform [33], as a function of time (second) and 
frequency (Hz). The majority of the energy is concentrated in 
the low frequency band and the footstep periods differ between 
these two classes of signals. For features, we extract a mel- 
frequency cepstral coefficient (MFCC, (34| ) vector using a 
50 msec, window. Only the first 13 MFCC coefficients were 
retained, which were experimentally determined to capture an 
average 90% of the power in the associated cepstra. There are 
in total 150 windows for each segment, resulting in a matrix of 
MFCC coefficients of size 13 x 150. We reshaped the matrix 
of MFCC features to obtain a 1950 dimensional feature vector 
for each segment. As in (24), (32), we apply PC A to reduce 
the dimensionality from 1950 to 50, while preserving 85% of 
the total power, 


In Tables [II] and [TIT] we compare the performance of kernel 
SVM, kernel MED, ROD for outlier parameter p G [0.01,1], 
and GEM-MED using four individually as well as in com¬ 
bination. For the combined sensors we used an augmented 
feature vector of dimension 200 via feature concatenation. We 
used a Gaussian RBF kernel function for the matrix K in 
the Gaussian process prior for the SVM decision function /. 
For the optimization of GEM-MED we used a separable prior 
and exponentially distributed hyperparameters, as indicated by 
(T~2] ) and ( p~3] ). Finally, the BP-kNNG implementation of GEM 
was applied on the training samples in the MFCC feature space 
with k = 10 nearest neighbors. The threshold •& is set using the 
Leave-One-Out resampling strategy (9], where each holdout 
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Fig. 7. The classification error of GEM-MED vs. (a) learning rates (p, when = 0.01, r = 0.02); (b) vs. i/j when (ip = 0.001, r = 0.02) and (c) vs. r 
when (cp = 0.001, ^ = 0.01). The vertical dotted line in each plot separates the breakdown region (to the right) and the stable region of misclassification 
performance. These threshold values do not vary significantly as the noise level R and corruption rate r a vary over the ranges investigated. 


sample corresponds to an entire segment. 

Table [II] shows the classification accuracy of the methods 
applied independently to each of the four individual sensors 
and to the combination of all four sensors. For ROD only 
p = 0.02 and p = 0.20 are shown; it was determined that 
p = 0.20 achieves the best performance in the range p G 
[0.01,1]. It is seen that the GEM-MED method outperforms 
the ROD-p algorithms for all values of p as a function of 
classification accuracy when individual sensors 1,2,4 are used. 
Notice that when used alone neither kernel MED nor kernel 
SVM is resistant to the sensor failures in the training set, 
which explains their poor accuracy in sensor 3 and sensor 
4. Moreover, in Table [II] we compare GEM-MED with a 
two-stage procedure that prescreens the SVM by using the 
GEM anomaly detector in [10]. At the first stage, the GEM 
anomaly detector screens out anomalies at false alarm level 
/3 and then at the second stage, the SVM classifier is trained 
on the screened data set. For fair comparison, we reweighted 
each error by the ratio filtered|/|^s|, where \V Si filtered| is 
the size of the screened data set and \V S \ is the number 
of samples in the original test data. Table [II] shows that 
the two stage learning approach has poor performance in 


highly corrupted sensors 3 and 4. This is due to the fact 
that when the GEM detector is learned without inferring the 
classification margin, it cannot effectively limit the negative 
influence of those corrupted samples that are close to the class 
boundary. As a consequence, the classifier is still vulnerable to 
these anomalies. This reflects the superiority of the proposed 
joint classification and detection approach of GEM-MED as 
compared with a standard two-stage approach. 

Table [HI] compares the anomaly detection accuracies for 
ROD and GEM-MED, where the accuracy is computed rel¬ 
ative to ground truth anomalies. Note that GEM-MED has 
significant improvement in accuracy over ROD when trained 
individually on sensors 1,3,4, respectively, and when trained 
on all of the combined sensors. When trained on sensor 
2 alone, the accuracies of GEM-MED and ROD-0.2 are 
essentially equivalent. In sensor 2 the anomalies appear to 
occur in concentrated bursts and we conjecture that that a 
GEM-MED model that accounts for clustered and dependent 
anomalies may be able to do better. Such an extension is left 
to future work. 
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TABLE II 

Classification accuracy for footstep experiment with different sensor combinations, with the best performance shown in bold. 


Classification Accuracy (%) mean db standard error 

sensor no. 

kernel SVM 

kernel MED 

ROD-0.02 

ROD-0.2 

GEM ± SVM 

GEM-MED 

1 

71.2 ±8.2 

71.1 ±5.3 

73.7 ±3.7 

76.0 ±2.5 

72.5 ±4.2 

78.4 ±3.3 

2 

60.8 ± 12.5 

62.3 ±10.2 

71.5 ±7.3 

76.5 ±5.3 

70.3 ± 2.5 

82.1 ±3.1 

3 

60.5 ± 14.2 

60.0 ±13.1 

63.2 ±5.4 

67.6 ±4.2 

56.5 ±3.5 

66.8 ±4.5 

4 

59.6 ± 10.1 

58.4 ±8.2 

71.8 ±7.2 

73.2 ±4.2 

76.5 ±2.7 

80.1 ±3.1 

1,2,3,4 

75.9 ±7.5 

78.6 ±5.1 

79.2 ±3.7 

79.8 ±2.5 

75.2 ±3.3 

84.0 ±2.3 



Time (Seconds) 


(b) 

Fig. 9. The power spectrogram (dB) vs. time (sec.) and frequency (Hz.) for 
a human-alone footstep (a) and a human-leading-animal footstep (b). Observe 
that the period of periodic footstep is a discriminative feature that separates 
these two signals. 

VI. Conclusion 


In this paper we proposed a unified GEM-MED approach 
for anomaly-resistant classification. We demonstrated its per¬ 
formance advantages in terms of both classification accuracy 
and detection rate on a simulated data set and on a real 
footstep data set, as compared to an anomaly-blind Ramp- 
Loss-based classification method (ROD). Further work could 
include generalization to the setting of multiple sensor types 
where anomalies exist in both training and test sets. 


TABLE III 

Anomaly detection accuracy with different sensors, with the 

BEST PERFORMANCE SHOWN IN BOLD. 


Anomaly Detection Accuracy (%) mean ± standard error 

sensor no. 

ROD-0.02 

ROD-0.2 

GEM-MED 

1 

30.2 ± 1.3 

59.0 ±3.5 

70.5 ±1.3 

2 

23.5 ±2.6 

63.5 ±2.8 

63.4 ±2.5 

3 

5.3 ±1.4 

48.1 ±3.3 

72.8 ± 1.5 

4 

22.8 ±3.2 

65.2 ±4.2 

88.1 ±2.1 

1,2,3,4 

38.5 ±6.3 

63.3 ±5.5 

88.5 ±4.1 
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Appendix 


A. Derivation of result 4.1 


Proof: The Lagrangian function is given as 

= [log 9 - logpo] + X iVn^c] - X 


neT 


^{±1} 


£D,2 


*e{±i} 


X Wl T\-P 


Jl'-Vn—Z 


with dual variables A = {A n ,n gT} ^ 0, g = (/i^z E 
±1) X 0 and v > 0. 

Then the result follows directly from solving a system of 
equations according to the KKT condition. 
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B. Derivation of result 


D. Implementation of Gibbs sampler 


Proof: According to (2j, the dual optimization is given as 
max — logZ(A, h,k) 

A,/i,,K.>0 

= log / exp ( c( 1 ^ n ) A n £ n ) d^ n 

neT J 

X j y exp ^-^f T K~ 1 f + ^2x n r] n y n f r ^j df 
x po(t}) exp - E m*E Vnd n + ^2 Pzlz 

\ zE{± 1} ri:z ze{±l} 


We implement a Gibbs sampler |35| to estimate 
E g (/ ?r? ) [G(/, 77)], where G is a general function of / and 77, 
as expressed in <(2TJ, ((22)1, The following procedure is 
applied iteratively 

• Initialization: Set f) 0 = [1,..., 1] T and set a fixed dual 
parameter (A, /z, «). Let Go = 0. 

• For each t = 1, 2,..., Tq or until convergence 

1) Given r) t _ 1 = (fj n j-i), generate decision value 

/t(x n ),n = 1,...,7V according to the Gaus¬ 

sian process ( [2Q| ) with mean function f t f) = 

ZlnET A n 7)n,t—l2/n^('j *^n)- 

2) Given {/t(x n )} 1 < n < Ar , for r = 1,..., 7V r , 


+ E k -E k z (3 I (*7 

2;G{±1} n:z zE{± 1} / 

= E + l0g 0 “ X n/c)) ~ 

ner ze{± 1} 

- log J exp (A © rj © y)) + rp (-ju 0 d + k 0 e) 

x Po(r])dr] 


where 


a) generate latent variables r^\ G {0,1} according 
to the Bernoulli distribution with parameter as 
in © for each n independently. 

3) Compute the sample mean of fj n j = Vnt 

G [0,1], 72 = 1,..., TV. Let fj t = ( 7 } n ,t)i r < n <Ar. 

4) Evaluate G t via G t = ^G t _i + \G{J u fi t ) 

Output the approximate expectation [G(/, 77 )] = 

Gt as well as the mean estimate fj T and fri^n), 1 < 
n < N when the Gibbs chain process becomes stationary. 


Q(K. x ) = x t Kx 

Q(K, (A © 77 )) := (A © rjf K (A © 77 ) 

= A T (K©(7777 T ))A [1] 

= Q(K( 77 ), A). PI 

[3] 


C. Derivation of ( [19] ), ( |20| ) [4] 

Proof: The expression for q{&) is given as [ 5 ] 


q(@) (X exp [ ff T K ' / + E ^nVnUnfn 


: Po(t]) exp - E m*E Vndn + E «*E Vn 

\ zE{±l} ri:z ze{± 1} ™z 

: P[ exp ( c + (c - A„)£„) 


neT 


= Q(f,’n)Y[Q(0 

n 

Given all r ] n , n G T, 

<?(/|77 ) oc exp + ^(A n r ? n )/n j 

= exp (^(/-K(A0r / 07/)) T K- 1 (/-K(A0r / 07/)| 


[ 6 ] 

[7] 

[ 8 ] 

[9] 

[ 10 ] 

[ 11 ] 

[42] 


= A'(K(AOr 7 © 2 /),K). 


[14] 


On the other hand, given /, 77 = (p n: n G T) are fully [15] 
separated in above formula, therefore < 7 ( 77 1/) = 
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